ggplot2Duration - 2 hour 30 minutes
Data visualisation is a key term in data science that you may already be familiar with. A common definition would be:
Data visualisation is the graphical representation of information and data. By using visual elements like charts, graphs and maps, data visualisation tools provide an accessible way to see and understand trends, outliers and patterns in data. In the world of data, data visualisation tools and technologies are essential for analysing massive amounts of information and making data-driven decisions.
In essence, data visualisation is the process of taking data and translating and communicating it in graphical form. Humans are exceptionally good at processing and recalling visual data. Long lists of numbers and big blocks of text - less so. So we can use visualisation to better understand the data we work with, and we can also use it to communicate data to others.
Why is this so important? Take a look at the graphs below.
Task - 2 mins Do you think this graph works well as an explanatory graph? How could it be improved?
Answer
No, it’s far too complicated. For so many reasons. Needs to focus on a few ideas and present the data in a clear way.
From the above chart, you can probably see why good data visualisation is both extremely hard, and extremely important. There are some general guidelines to follow when creating informative plots:
Yes, it seems like some of these are rudimentary graphing skills — but you’d be amazed at how many people manage to miss them (especially when using R, as you have to generally add them separately).
Task - 5 mins
Explain why each of these charts misrepresents the data.
![]()
![]()
Answer
In the first graph the height of the bars clearly don’t correspond to the relative values.
The second graph’s main sin is that it inverts the y-axis - without studying the graph closely you might imagine gun deaths fell when in fact they rose.
- The third graph’s issue is perhaps the subtlest of the three. The heights of the objects correctly correspond to the relative values, but they’ve also blown up the width. If you increase height and width by a factor of 3 (say) then the area increases by a factor of 3x3=9 giving a misleading impression to the reader.
Now we know what we don’t want, let’s get started.
ggplot2ggplot2 is an R package (part of the tidyverse) that enables you to create data visualisations. You can use it to create everything from simple bar graphs all the way up to detailed maps. It can take a while to get your head around how it works, however once you understand it I’m sure you’ll find ggplot2 very powerful.
ggplot2 “grammar”
There are four main parts of the basic ggplot2 call:
image from sharpsightlabs.com
ggplot(). This initiates plotting. That’s all it does. Within this function you include the data you are plotting.aes function. This allows you to choose which parts of your data you are going to plot. More precisely, the aes() function allows you to map the variables in your data frame to the aesthetic attributes of the geometric objects of your plot.+ call.geom_bar(). More precisely, these are called the geoms of the plot: short for the geometric shapes you’ll use. Lines, points, and bars all all types of geoms. This will become clearer once you start using it more.image from sharpsightlabs.com
And that’s pretty much it! You can build commands on top of each other so that you can produce more and more complex plots, but ultimately you just need this simple starting block and can repeat syntax. This syntax flow is highly structured. This is where thes name ‘ggplot’ comes from: is short for ‘grammar of graphics plot’. Similarly to the structure of grammar, ggplot2 has a consistent and structured workflow. This structured nature of ggplot2 is one of its best features.
Task - 2 mins
Identify the geometric objects and aesthetic mapping used in each of this plots.
Answer
- Geoms: Lines!
- Aesthetics: x = year, y = intake, colour = food type
ggplot2()If you haven’t already, install ggplot2. We’ll be using data from the CodeClanData package, so load that along with ggplot2.
install.packages("ggplot2")
library(ggplot2)
library(CodeClanData)
We will plot the preferred superpowers from our students dataset.
students.Translating this into ggplot syntax gives us:
ggplot(students) +
geom_bar(aes(x = superpower))
So the first call is always to ggplot() and the first argument is always the dataset. Next, we include the type of geom we want geom_bar, along with the aes() terms we want. In our case, we are plotting superpower on the x axis.
You might be wondering why we don’t have anything specified for the y axis? In this case, it is because geom_bar() is programmed to automatically count the data within your chosen variable (here, superpower) as bar graphs display frequency or count data. Although it is basic, it’s a semi-decent graph in just two lines of code.
aes and outside aes.Say we want to colour the bars in. This colouring doesn’t depend on a specific variable from the data, so it goes outside aes.
ggplot(students) +
geom_bar(aes(x = superpower), fill = "light blue")
You can also do colour your bars by variables in the data. Let’s say we want to colour the bars in by the year each student is in. We can do this by setting fill to be mapped the school year, inside aes.
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year))
Looking fine. But what if you hate the default colours? Where is the GUI option to change that around? Thankfully the ggplot2 designers have thought of that and given you some different ways to do that.
The first is to do it manually using the scale_fill_manual() function:
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year)) + scale_fill_manual(values = c("red","yellow","orange","pink","purple"))
Not the best choice, but it did work…
Thankfully, you can also scale the colours by using other automatic color scales, such as ones taken from the RColorBrewer package. For example, their default is:
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year)) + scale_fill_brewer()
with lots of other themes available to choose from.
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year)) + scale_fill_brewer(palette = "Pastel3")
## Warning in pal_name(palette, type): Unknown palette Pastel3
While this graph is starting to border on one that’s potentially a tad confusing, it just shows you how easy it is to create visualisations from your data in a couple of lines of code.
When you make a bar chart with fill, by default the bars are stacked on top of each other as above. But we can change this by changing the position argument in geom_bar. We can use position = "dodge" to change the bars to be side by side.
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year), position = "dodge") +
scale_fill_brewer()
Or position = "fill" to make each bar the same height, and let the colour show the relative proportions. For this, we want to change the fill colours so we use scale_fill_brewer() instead of scale_colour_brewer().
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year), position = "fill") +
scale_fill_brewer()
This graph is looking alright now, and hopefully it is clear that you can start with a basic plot in ggplot2, and continually add and tweak different parts of it in order for it to look how you want it.
As we said above, the default geom_bar is actually doing a statistical transformation for us: it’s counting the number in each group and using that to make the bar. This is the “count” statistic in ggplot2, if you specify stat = "count" in geom_bar our plot will look the same.
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year), stat = "count") +
scale_fill_brewer()
But what if we have data where the counting has already been calculated? For this, let’s create some a count column in our data:
count_data <- students %>%
select(superpower,school_year) %>%
group_by(school_year) %>%
mutate(counts = n())
If you try to plot this as it is, you’ll get an error:
# try with no count used
ggplot(count_data) +
geom_bar(aes(x = superpower, y = counts))
This is because it doesn’t know what to do if you’ve already got count data. In this case, we need to specify that we use no statistical transformation in geom_bar. We do this by setting stat = "identity" which translates to: plot the data as is.
ggplot(count_data) +
geom_bar(aes(x = superpower, y = counts), stat = "identity")
Alternatively, you can use geom_col, which is the same as geom_bar but with no statistical transformation by default.
ggplot(count_data) +
geom_col(aes(x = superpower, y = counts))
The final thing we’ll cover in our basic plot are labels are an important part of making our plots easy to understand. R will fill in labels based on the names from the data, by default. However, you will often want to overwrite these using labs. You can specify the xlab and ylab, for the x and y label. You can add a title and subtitle with title and subtitle respectively. Also you can change the title of any legend, by giving the aesthetic name.
If you want to add more space you can include a newline indicator: “”.
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year)) +
labs(
x = "\nSuperpower",
y = "Count",
title = "Preferred Superpower by School Year",
subtitle = "Data from students around the world\n",
fill = "School Year"
) +
scale_fill_brewer()
Note that you can use the xlab, ylab, and ggtitle functions instead:
ggplot(students) +
geom_bar(aes(x = superpower, fill = school_year)) +
xlab("\nSuperpower") +
ylab("Count") +
ggtitle("Preferred Superpower by School Year",
subtitle = "Data from students around the world\n") +
labs(fill = "School Year") +
scale_fill_brewer()
And there you have a plot that is at least informative and accurately displays the data.
Task - 10 mins
Now it’s your turn. Take this subset of the
olympics_overall_medalsdata from theCodeClanDatapackage. It shows the top 10 countries with the most medals.top_10 <- olympics_overall_medals %>% arrange(desc(count)) %>% top_n(10)## Selecting by counttop_10
Create an informative plot that plots the count of medals by team. Write down an explanation of what the plot shows.
Answer
ggplot(top_10) + geom_bar(aes(x = team, y = count), fill = "gold", stat = "identity") + coord_flip() + labs( y = "Number of Medals", x = "Team", title = "Top 10 teams for all time Gold Meal count" )![]()
Plots in ggplot (like ogres) have layers. The bar plots we created above is only a single layer, but in ggplot2 you can build up plots with many geoms.
Let’s have an example. We have a lot of chickens, and 4 types of chicken feed. The chickens are split into 4 groups, each group is fed one of these type of feed, and each chicken is weighed regularly. We want to show how the weights of these chickens increase over time.
Below is an example of a one-layer ggplot that visualises our chicken data.
data("ChickWeight") # chickweight is a standard built-in dataset available in Base R
head(ChickWeight)
ggplot(ChickWeight) +
geom_line(
aes(x = Time, y = weight, group = Chick, colour = Diet))
Our first layer shows how chicken weights change over time, on different types of diets, in the form of lines. Now we add a second layer which identifies the actual observations. This one will use points as it’s geometric object. The structure of the call is very similar to the first layer.
ggplot(ChickWeight) +
geom_line(
aes(x = Time, y = weight, group = Chick, colour = Diet)) +
geom_point(
aes(x = Time, y = weight, colour = Diet))
Now we have our individual observations and our lines. We can add a final layer we add a smoothed trend line and confidence band for each group. These statistics are automatically calculated by the geom_smooth() function. You can alter the method used with the argument “method=” - see ?geom_smooth for details.
ggplot(ChickWeight) +
geom_line(
aes(x = Time, y = weight, group = Chick, colour = Diet),
alpha = 0.25
) +
geom_point(
aes(x = Time, y = weight, colour = Diet),
alpha = 0.5
) +
geom_smooth(
aes(x = Time, y = weight, colour = Diet)
)
Note: the alpha argument sets the transparency of the geoms you are plotting
So now we have one plot, with three layers. However, there is some redundancy in this code. We are using the same aesthetics for almost every layer. Any aesthetics that apply to every layer can be placed either inside ggplot or just after.
ggplot(ChickWeight) +
aes(x = Time, y = weight, colour = Diet) +
geom_line(aes(group = Chick), alpha = 0.25) +
geom_point(alpha = 0.5) +
geom_smooth()
This shows how powerful ggplot is: with the same syntax, you can add multiple layers to your plots easily.
Task 1 - 10 mins
Using the
studentsdataset:
Use
geom_pointto make a scatter graph, with the height of students on the x-axis and their reaction time of the y axis.Make all the points blue. For
geom_bar, the colour of the bar is controlled byfill, but forgeom_pointthe colour of the points are controlled bycolour.Make the colour of the points depend on the
superpowerthe student wishes they had.Change the position of the plot to
jitter. What do you see?Write down what the graph tells you overall.
Answer
ggplot(students) + geom_point(aes(x = height_cm, y = reaction_time))
ggplot(students) + geom_point(aes(x = height_cm, y = reaction_time), colour = "blue")
ggplot(students) + geom_point(aes(x = height_cm, y = reaction_time, colour = superpower))
ggplot(students) + geom_point(aes(x = height_cm, y = reaction_time, colour = superpower), position = "jitter")
Each point has been moved up and down a small amount. This isn’t very useful in this plot, but is useful when you have many points on top of each other.
Task 2 - 10 mins
Use the dataset
petsfrom theCodeClanDatapackage to do the following:Create a labelled scatter plot, of pet age vs. weight.
- We want age of the x-axis and weight on the y axis
- We want the points the be different colours depending on the gender of the pet, and different shapes depending on the type of animal.
- We want all the points to be bigger than normal (size 4).
- We also want labels with the pets names next to every point.
Answer
ggplot(pets) + aes(x = age, y = weight) + geom_point(aes(colour = sex, shape = animal), size = 4) + geom_text( aes(label = name), nudge_x = 0.5, nudge_y = 0.1, )
Finally, different layers can also use different datasets that are specified using the data argument in a geom. This is particularly useful if we want a geom to only plot a subset of the data. For example, here we are only labelling “Fluffy”.
ggplot(pets) +
aes(x = age, y = weight) +
geom_point(aes(colour = sex, shape = animal), size = 4) +
geom_text(
aes(label = name),
nudge_x = 0.5,
nudge_y = 0.1,
data = subset(pets, name == "Fluffy")
)
We can also save the image generated using the ggsave function. This saves the last image by default.
ggplot(pets) +
aes(x = age, y = weight) +
geom_point(aes(colour = sex, shape = animal), size = 4) +
geom_text(
aes(label = name),
nudge_x = 0.5,
nudge_y = 0.1,
)
ggsave("g1_sav.pdf")
ggsave("g1_sav.png")
You can alter the size of the raster graphics using the “width” and “height” arguments.
The function recognises the file extension (e.g. .pdf or .png) and saves in the appropriate format. You can also use the export button in RStudio (at the top of the Plots pane).
ggplot2’s approach to visualisation?Answer
ggplot uses a grammar of graphics based approach
Answer
ggplot(data = df, aes(<default mappings)) +
geom_type(stat = "<statistic>", position"<position-adjustment",
aes(mappings specific to this layer), <hardcoded-aesthetics>) +
..
geom_type(stat = "<statistic>", position"<position-adjustment",
aes(mappings specific to this layer), <hardcoded-aesthetics>)
Links of where else to look